Skip to main content

Nonlinear Constrained Optimization - Equality

Problems with Equality Constraints

The class of optimization problems we analyze here is

minimizef(x) subject to h(x)=0\begin{aligned} \operatorname{minimize} & f(\boldsymbol{x}) \\ \text { subject to } & \boldsymbol{h}(\boldsymbol{x})=\mathbf{0} \end{aligned}

where xRn,f:RnR,h:RnRm,h=[h1,,hm]\boldsymbol{x} \in \mathbb{R}^{n}, f: \mathbb{R}^{n} \rightarrow \mathbb{R}, \boldsymbol{h}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}, \boldsymbol{h}=\left[h_{1}, \ldots, h_{m}\right]^{\top}, and mnm \leq n. We assume that the function h\boldsymbol{h} is continuously differentiable, that is, hC1\boldsymbol{h} \in \mathcal{C}^{1}.

Definition

A point xx^{*} satisfying the constraints h1(x)=h_{1}\left(x^{*}\right)= 0,,hm(x)=00, \ldots, h_{m}\left(\boldsymbol{x}^{*}\right)=0 is said to be a regular point of the constraints if the gradient vectors h1(x),,hm(x)\nabla h_{1}\left(\boldsymbol{x}^{*}\right), \ldots, \nabla h_{m}\left(\boldsymbol{x}^{*}\right) are linearly independent.

Let Dh(x)\boldsymbol{D} \boldsymbol{h}\left(\boldsymbol{x}^{*}\right) be the Jacobian matrix of h=[h1,,hm]\boldsymbol{h}=\left[h_{1}, \ldots, h_{m}\right]^{\top} at x\boldsymbol{x}^{*}, given by

Dh(x)=[Dh1(x)Dhm(x)]=[h1(x)hm(x)].D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)=\left[\begin{array}{c} D h_{1}\left(\boldsymbol{x}^{*}\right) \\ \vdots \\ D h_{m}\left(\boldsymbol{x}^{*}\right) \end{array}\right]=\left[\begin{array}{c} \nabla h_{1}\left(\boldsymbol{x}^{*}\right)^{\top} \\ \vdots \\ \nabla h_{m}\left(\boldsymbol{x}^{*}\right)^{\top} \end{array}\right] .

Then, x\boldsymbol{x}^{*} is regular if and only if rank Dh(x)=mD h\left(\boldsymbol{x}^{*}\right)=m (i.e., the Jacobian matrix is of full rank).

The set of equality constraints h1(x)=0,,hm(x)=0,hi:RnRh_{1}(\boldsymbol{x})=0, \ldots, h_{m}(\boldsymbol{x})=0, h_{i}: \mathbb{R}^{n} \rightarrow \mathbb{R}, describes a surface

S={xRn:h1(x)=0,,hm(x)=0}S=\left\{\boldsymbol{x} \in \mathbb{R}^{n}: h_{1}(\boldsymbol{x})=0, \ldots, h_{m}(\boldsymbol{x})=0\right\}

Assuming that the points in SS are regular, the dimension of the surface SS is nmn-m

Tangent and Normal Spaces

In this section we discuss the notion of a tangent space and normal space at a point on a surface. We begin by defining a curve on a surface SS.

Definition

A curve CC on a surface SS is a set of points {x(t)S:t\{\boldsymbol{x}(t) \in S: t \in (a,b)}(a, b)\}, continuously parameterized by t(a,b);t \in(a, b) ; that is, x:(a,b)S\boldsymbol{x}:(a, b) \rightarrow S is a continuous function.

A graphical illustration of the definition of a curve is given in Figure 20.4. The definition of a curve implies that all the points on the curve satisfy the equation describing the surface. The curve CC passes through a point x\boldsymbol{x}^{*} if there exists t(a,b)t^{*} \in(a, b) such that x(t)=x\boldsymbol{x}\left(t^{*}\right)=\boldsymbol{x}^{*}.

Intuitively, we can think of a curve C={x(t):t(a,b)}C=\{x(t): t \in(a, b)\} as the path traversed by a point x\boldsymbol{x} traveling on the surface SS. The position of the point at time tt is given by x(t)\boldsymbol{x}(t).

Definition

The curve C={x(t):t(a,b)}C=\{\boldsymbol{x}(t): t \in(a, b)\} is differentiable if

x˙(t)=dxdt(t)=[x˙1(t)x˙n(t)]\dot{\boldsymbol{x}}(t)=\frac{d \boldsymbol{x}}{d t}(t)=\left[\begin{array}{c} \dot{x}_{1}(t) \\ \vdots \\ \dot{x}_{n}(t) \end{array}\right]

exists for all t(a,b)t \in(a, b).

Figure 20.4 Curve on a surface.

Figure 20.5 Geometric interpretation of the differentiability of a curve.

The curve C={x(t):t(a,b)}C=\{\boldsymbol{x}(t): t \in(a, b)\} is twice differentiable if

x¨(t)=d2xdt2(t)=[x¨1(t)x¨n(t)]\ddot{\boldsymbol{x}}(t)=\frac{d^{2} \boldsymbol{x}}{d t^{2}}(t)=\left[\begin{array}{c} \ddot{x}_{1}(t) \\ \vdots \\ \ddot{x}_{n}(t) \end{array}\right]

exists for all t(a,b)t \in(a, b).

Note that both x˙(t)\dot{x}(t) and x¨(t)\ddot{x}(t) are nn-dimensional vectors. We can think of x˙(t)\dot{\boldsymbol{x}}(t) and x¨(t)\ddot{\boldsymbol{x}}(t) as the velocity and acceleration, respectively, of a point traversing the curve CC with position x(t)\boldsymbol{x}(t) at time tt. The vector x˙(t)\dot{x}(t) points in the direction of the instantaneous motion of x(t)\boldsymbol{x}(t). Therefore, the vector x˙(t)\dot{\boldsymbol{x}}\left(t^{*}\right) is tangent to the curve CC at x\boldsymbol{x}^{*} (see Figure 20.5).

We are now ready to introduce the notions of a tangent space. For this recall the set

S={xRn:h(x)=0},S=\left\{\boldsymbol{x} \in \mathbb{R}^{n}: \boldsymbol{h}(\boldsymbol{x})=\mathbf{0}\right\},

where hC1\boldsymbol{h} \in \mathcal{C}^{1}. We think of SS as a surface in Rn\mathbb{R}^{n}.

Definition

The tangent space at a point x\boldsymbol{x}^{*} on the surface S={xS=\{\boldsymbol{x} \in Rn:h(x)=0}\left.\mathbb{R}^{n}: \boldsymbol{h}(\boldsymbol{x})=\mathbf{0}\right\} is the set T(x)={y:Dh(x)y=0}T\left(\boldsymbol{x}^{*}\right)=\left\{\boldsymbol{y}: D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right) \boldsymbol{y}=\mathbf{0}\right\}.

Note that the tangent space T(x)T\left(\boldsymbol{x}^{*}\right) is the nullspace of the matrix Dh(x)D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right) :

T(x)=N(Dh(x)).T\left(\boldsymbol{x}^{*}\right)=\mathcal{N}\left(D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)\right) .

The tangent space is therefore a subspace of Rn\mathbb{R}^{n}.

Assuming that x\boldsymbol{x}^{*} is regular, the dimension of the tangent space is nmn-m, where mm is the number of equality constraints hi(x)=0h_{i}\left(\boldsymbol{x}^{*}\right)=0. Note that the tangent space passes through the origin. However, it is often convenient to picture the tangent space as a plane that passes through the point x\boldsymbol{x}^{*}. For this, we define the tangent plane at x\boldsymbol{x}^{*} to be the set

TP(x)=T(x)+x={x+x:xT(x)}.T P\left(\boldsymbol{x}^{*}\right)=T\left(\boldsymbol{x}^{*}\right)+\boldsymbol{x}^{*}=\left\{\boldsymbol{x}+\boldsymbol{x}^{*}: \boldsymbol{x} \in T\left(\boldsymbol{x}^{*}\right)\right\} .

Figure 20.6 Tangent plane to the surface SS at the point x\boldsymbol{x}^{*}.

Theorem

Suppose that xS\boldsymbol{x}^{*} \in S is a regular point and T(x)T\left(\boldsymbol{x}^{*}\right) is the tangent space at x\boldsymbol{x}^{*}. Then, yT(x)\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right) if and only if there exists a differentiable curve in SS passing through x\boldsymbol{x}^{*} with derivative y\boldsymbol{y} at x\boldsymbol{x}^{*}.

Proof

:\Leftarrow: Suppose that there exists a curve {x(t):t(a,b)}\{\boldsymbol{x}(t): t \in(a, b)\} in SS such that x(t)=x\boldsymbol{x}\left(t^{*}\right)=\boldsymbol{x}^{*} and x˙(t)=y\dot{\boldsymbol{x}}\left(t^{*}\right)=\boldsymbol{y} for some t(a,b)t^{*} \in(a, b). Then,

h(x(t))=0\boldsymbol{h}(\boldsymbol{x}(t))=0

Figure 20.8 The surface S={xR3:x1=0,x1x2=0}S=\left\{\boldsymbol{x} \in \mathbb{R}^{3}: x_{1}=0, x_{1}-x_{2}=0\right\}.

for all t(a,b)t \in(a, b). If we differentiate the function h(x(t))\boldsymbol{h}(\boldsymbol{x}(t)) with respect to tt using the chain rule, we obtain

ddth(x(t))=Dh(x(t))x˙(t)=0\frac{d}{d t} \boldsymbol{h}(\boldsymbol{x}(t))=D \boldsymbol{h}(\boldsymbol{x}(t)) \dot{\boldsymbol{x}}(t)=\mathbf{0}

for all t(a,b)t \in(a, b). Therefore, at tt^{*} we get

Dh(x)y=0,D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right) \boldsymbol{y}=\mathbf{0},

and hence yT(x)\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right).

\Rightarrow : To prove this, we need to use the implicit function theorem.

We now introduce the notion of a normal space.

Definition

The normal space N(x)N\left(\boldsymbol{x}^{*}\right) at a point x\boldsymbol{x}^{*} on the surface S=S= {xRn:h(x)=0}\left\{\boldsymbol{x} \in \mathbb{R}^{n}: \boldsymbol{h}(\boldsymbol{x})=\mathbf{0}\right\} is the set N(x)={xRn:x=Dh(x)z,zRm}N\left(\boldsymbol{x}^{*}\right)=\left\{\boldsymbol{x} \in \mathbb{R}^{n}: \boldsymbol{x}=D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)^{\top} \boldsymbol{z}, \boldsymbol{z} \in \mathbb{R}^{m}\right\}.

We can express the normal space N(x)N\left(\boldsymbol{x}^{*}\right) as

N(x)=R(Dh(x)),N\left(\boldsymbol{x}^{*}\right)=\mathcal{R}\left(D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)^{\top}\right),

that is, the range of the matrix Dh(x)D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)^{\top}. Note that the normal space N(x)N\left(\boldsymbol{x}^{*}\right) is the subspace of Rn\mathbb{R}^{n} spanned by the vectors h1(x),,hm(x)\nabla h_{1}\left(\boldsymbol{x}^{*}\right), \ldots, \nabla h_{m}\left(\boldsymbol{x}^{*}\right); that is,

N(x)=span[h1(x),,hm(x)]={xRn:x=z1h1(x)++zmhm(x),z1,,zmR}\begin{aligned} N\left(\boldsymbol{x}^{*}\right) &=\operatorname{span}\left[\nabla h_{1}\left(\boldsymbol{x}^{*}\right), \ldots, \nabla h_{m}\left(\boldsymbol{x}^{*}\right)\right] \\ &=\left\{\boldsymbol{x} \in \mathbb{R}^{n}: \boldsymbol{x}=z_{1} \nabla h_{1}\left(\boldsymbol{x}^{*}\right)+\cdots+z_{m} \nabla h_{m}\left(\boldsymbol{x}^{*}\right), z_{1}, \ldots, z_{m} \in \mathbb{R}\right\} \end{aligned}

Figure 20.9 Normal space in R3\mathbb{R}^{3}.

Note that the normal space contains the zero vector. Assuming that x\boldsymbol{x}^{*} is regular, the dimension of the normal space N(x)N\left(x^{*}\right) is mm. As in the case of the tangent space, it is often convenient to picture the normal space N(x)N\left(\boldsymbol{x}^{*}\right) as passing through the point xx^{*} (rather than through the origin of Rn\mathbb{R}^{n} ). For this, we define the normal plane at x\boldsymbol{x}^{*} as the set

NP(x)=N(x)+x={x+xRn:xN(x)}.N P\left(\boldsymbol{x}^{*}\right)=N\left(\boldsymbol{x}^{*}\right)+\boldsymbol{x}^{*}=\left\{\boldsymbol{x}+\boldsymbol{x}^{*} \in \mathbb{R}^{n}: \boldsymbol{x} \in N\left(\boldsymbol{x}^{*}\right)\right\} .

Figure 20.920.9 illustrates the normal space and plane in R3\mathbb{R}^{3} (i.e., n=3n=3 and m=1)m=1)

We now show that the tangent space and normal space are orthogonal complements of each other (see Section 3.3).

Lemma

We have T(x)=N(x)T\left(\boldsymbol{x}^{*}\right)=N\left(\boldsymbol{x}^{*}\right)^{\perp} and T(x)=N(x)T\left(\boldsymbol{x}^{*}\right)^{\perp}=N\left(\boldsymbol{x}^{*}\right).

Proof

By definition of T(x)T\left(\boldsymbol{x}^{*}\right), we may write

T(x)={yRn:xy=0 for all xN(x)}.T\left(\boldsymbol{x}^{*}\right)=\left\{\boldsymbol{y} \in \mathbb{R}^{n}: \boldsymbol{x}^{\top} \boldsymbol{y}=0 \text { for all } \boldsymbol{x} \in N\left(\boldsymbol{x}^{*}\right)\right\} .

Hence, by definition of N(x)N\left(\boldsymbol{x}^{*}\right), we have T(x)=N(x)T\left(\boldsymbol{x}^{*}\right)=N\left(\boldsymbol{x}^{*}\right)^{\perp}.

Lagrange Condition

To better understand the idea underlying this theorem, we first consider functions of two variables and only one equality constraint. Let h:R2Rh: \mathbb{R}^{2} \rightarrow \mathbb{R} be the constraint function. Recall that at each point xx of the domain, the gradient vector h(x)\nabla h(\boldsymbol{x}) is orthogonal to the level set that passes through that point. Indeed, let us choose a point x=[x1,x2]\boldsymbol{x}^{*}=\left[x_{1}^{*}, x_{2}^{*}\right]^{\top} such that h(x)=0h\left(\boldsymbol{x}^{*}\right)=0, and assume that h(x)0\nabla h\left(\boldsymbol{x}^{*}\right) \neq \mathbf{0}. The level set through the point x\boldsymbol{x}^{*} is the set {x:h(x)=0}\{\boldsymbol{x}: h(\boldsymbol{x})=0\}. We then parameterize this level set in a neighborhood of x\boldsymbol{x}^{*} by a curve {x(t)}\{\boldsymbol{x}(t)\}, that is, a continuously differentiable vector function x:RR2\boldsymbol{x}: \mathbb{R} \rightarrow \mathbb{R}^{2} such that

x(t)=[x1(t)x2(t)],t(a,b),x=x(t),x˙(t)0,t(a,b).\boldsymbol{x}(t)=\left[\begin{array}{l} x_{1}(t) \\ x_{2}(t) \end{array}\right], \quad t \in(a, b), \quad \boldsymbol{x}^{*}=\boldsymbol{x}\left(t^{*}\right), \quad \dot{\boldsymbol{x}}\left(t^{*}\right) \neq \mathbf{0}, \quad t^{*} \in(a, b) .

We can now show that h(x)\nabla h\left(\boldsymbol{x}^{*}\right) is orthogonal to x˙(t)\dot{x}\left(t^{*}\right). Indeed, because hh is constant on the curve {x(t):t(a,b)}\{\boldsymbol{x}(t): t \in(a, b)\}, we have that for all t(a,b)t \in(a, b),

h(x(t))=0.h(\boldsymbol{x}(t))=0 .

Hence, for all t(a,b)t \in(a, b),

ddth(x(t))=0\frac{d}{d t} h(\boldsymbol{x}(t))=0

Applying the chain rule, we get

ddth(x(t))=h(x(t))x˙(t)=0.\frac{d}{d t} h(\boldsymbol{x}(t))=\nabla h(\boldsymbol{x}(t))^{\top} \dot{\boldsymbol{x}}(t)=0 .

Therefore, h(x)\nabla h\left(\boldsymbol{x}^{*}\right) is orthogonal to x˙(t)\dot{\boldsymbol{x}}\left(t^{*}\right).

Now suppose that x\boldsymbol{x}^{*} is a minimizer of f:R2Rf: \mathbb{R}^{2} \rightarrow \mathbb{R} on the set {x:h(x)=\{\boldsymbol{x}: h(\boldsymbol{x})= 0 ). We claim that f(x)\nabla f\left(x^{*}\right) is orthogonal to x˙(t)\dot{x}\left(t^{*}\right). To see this, it is enough to observe that the composite function of tt given by

ϕ(t)=f(x(t))\phi(t)=f(\boldsymbol{x}(t))

achieves a minimum at tt^{*}. Consequently, the first-order necessary condition for the unconstrained extremum problem implies that

dϕdt(t)=0.\frac{d \phi}{d t}\left(t^{*}\right)=0 .

Applying the chain rule yields

0=ddtϕ(t)=f(x(t))x˙(t)=f(x)x˙(t)0=\frac{d}{d t} \phi\left(t^{*}\right)=\nabla f\left(\boldsymbol{x}\left(t^{*}\right)\right)^{\top} \dot{\boldsymbol{x}}\left(t^{*}\right)=\nabla f\left(\boldsymbol{x}^{*}\right)^{\top} \dot{\boldsymbol{x}}\left(t^{*}\right)

Thus, f(x)\nabla f\left(x^{*}\right) is orthogonal to x˙(t)\dot{x}\left(t^{*}\right). The fact that x˙(t)\dot{x}\left(t^{*}\right) is tangent to the curve {x(t)}\{\boldsymbol{x}(t)\} at x\boldsymbol{x}^{*} means that f(x)\nabla f\left(\boldsymbol{x}^{*}\right) is orthogonal to the curve at x\boldsymbol{x}^{*} (see Figure 20.10)20.10).

Figure 20.10 The gradient f(x)\nabla f\left(\boldsymbol{x}^{*}\right) is orthogonal to the curve {x(t)}\{\boldsymbol{x}(t)\} at the point x\boldsymbol{x}^{*} that is a minimizer of ff on the curve.

Recall that h(x)\nabla h\left(x^{*}\right) is also orthogonal to x˙(t)\dot{x}\left(t^{*}\right). Therefore, the vectors h(x)\nabla h\left(\boldsymbol{x}^{*}\right) and f(x)\nabla f\left(\boldsymbol{x}^{*}\right) are parallel; that is, f(x)\nabla f\left(\boldsymbol{x}^{*}\right) is a scalar multiple of h(x)\nabla h\left(\boldsymbol{x}^{*}\right). The observations above allow us now to formulate Lagrange's theorem for functions of two variables with one constraint.

Theorem (Lagrange's Theorem for n=2,m=1n=2, m=1)

Let the point x\boldsymbol{x}^{*} be a minimizer of f:R2Rf: \mathbb{R}^{2} \rightarrow \mathbb{R} subject to the constraint h(x)=0,h:R2Rh(x)=0, h: \mathbb{R}^{2} \rightarrow \mathbb{R}. Then, f(x)\nabla f\left(\boldsymbol{x}^{*}\right) and h(x)\nabla h\left(\boldsymbol{x}^{*}\right) are parallel. That is, if h(x)0\nabla h\left(\boldsymbol{x}^{*}\right) \neq \mathbf{0}, then there exists a scalar λ\lambda^{*} such that

f(x)+λh(x)=0.\nabla f\left(\boldsymbol{x}^{*}\right)+\lambda^{*} \nabla h\left(\boldsymbol{x}^{*}\right)=\mathbf{0} .

We refer to λ\lambda^{*} as the Lagrange multiplier. Note that the theorem also holds for maximizers. Figure 20.1120.11 gives an illustration of Lagrange's theorem for the case where x\boldsymbol{x}^{*} is a maximizer of ff over the set {x:h(x)=0}\{\boldsymbol{x}: h(\boldsymbol{x})=0\}.

Lagrange's theorem provides a first-order necessary condition for a point to be a local minimizer. This condition, which we call the Lagrange condition, consists of two equations:

f(x)+λh(x)=0h(x)=0.\begin{aligned} \nabla f\left(x^{*}\right)+\lambda^{*} \nabla h\left(x^{*}\right) &=0 \\ h\left(\boldsymbol{x}^{*}\right) &=0 . \end{aligned}

Note that the Lagrange condition is necessary but not sufficient. In Figure 20.1220.12 we illustrate a variety of points where the Lagrange condition is

Figure 20.11 Lagrange's theorem for n=2,m=1n=2, m=1.

satisfied, including a case where the point is not an extremizer (neither a maximizer nor a minimizer).

We now generalize Lagrange's theorem for the case when f:RnRf: \mathbb{R}^{n} \rightarrow \mathbb{R} and h:RnRm,mn\boldsymbol{h}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}, m \leq n

Lagrange's Theorem

Let xx^{*} be a local minimizer (or maximizer) of f:RnRf: \mathbb{R}^{n} \rightarrow \mathbb{R}, subject to h(x)=0,h:RnRm,mn\boldsymbol{h}(\boldsymbol{x})=\mathbf{0}, \boldsymbol{h}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}, m \leq n. Assume that x\boldsymbol{x}^{*} is a regular point. Then, there exists λRm\boldsymbol{\lambda}^{*} \in \mathbb{R}^{m} such that

Df(x)+λDh(x)=0.D f\left(\boldsymbol{x}^{*}\right)+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)=\mathbf{0}^{\top} .

Proof

We need to prove that

f(x)=Dh(x)λ\nabla f\left(\boldsymbol{x}^{*}\right)=-D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)^{\top} \boldsymbol{\lambda}^{*}

for some λRm;\boldsymbol{\lambda}^{*} \in \mathbb{R}^{m} ; that is, f(x)R(Dh(x))=N(x)\nabla f\left(\boldsymbol{x}^{*}\right) \in \mathcal{R}\left(D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)^{\top}\right)=N\left(\boldsymbol{x}^{*}\right). But by Lemma 20.1,N(x)=T(x)20.1, N\left(\boldsymbol{x}^{*}\right)=T\left(\boldsymbol{x}^{*}\right)^{\perp}. Therefore, it remains to show that f(x)T(x)\nabla f\left(x^{*}\right) \in T\left(x^{*}\right)^{\perp}

We proceed as follows. Suppose that

yT(x).\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right) .

Then, by Theorem 20.120.1, there exists a differentiable curve {x(t):t(a,b)}\{\boldsymbol{x}(t): t \in(a, b)\} such that for all t(a,b)t \in(a, b),

h(x(t))=0,\boldsymbol{h}(\boldsymbol{x}(t))=\mathbf{0},

and there exists t(a,b)t^{*} \in(a, b) satisfying

x(t)=x,x˙(t)=y.\boldsymbol{x}\left(t^{*}\right)=\boldsymbol{x}^{*}, \quad \dot{\boldsymbol{x}}\left(t^{*}\right)=\boldsymbol{y} .

Now consider the composite function ϕ(t)=f(x(t))\phi(t)=f(\boldsymbol{x}(t)). Note that tt^{*} is a local minimizer of this function. By the first-order necessary condition for unconstrained local minimizers (see Theorem 6.1),

dϕdt(t)=0.\frac{d \phi}{d t}\left(t^{*}\right)=0 .

Applying the chain rule yields

dϕdt(t)=Df(x)x˙(t)=Df(x)y=f(x)y=0.\frac{d \phi}{d t}\left(t^{*}\right)=D f\left(\boldsymbol{x}^{*}\right) \dot{\boldsymbol{x}}\left(t^{*}\right)=D f\left(\boldsymbol{x}^{*}\right) \boldsymbol{y}=\nabla f\left(\boldsymbol{x}^{*}\right)^{\top} \boldsymbol{y}=0 .

So all yT(x)\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right) satisfy

f(x)y=0\nabla f\left(x^{*}\right)^{\top} \boldsymbol{y}=0

Figure 20.13 Example where the Lagrange condition does not hold.

that is,

f(x)T(x)\nabla f\left(\boldsymbol{x}^{*}\right) \in T\left(\boldsymbol{x}^{*}\right)^{\perp}

This completes the proof.

Theorem (Second-Order Necessary Conditions)

Let x\boldsymbol{x}^{*} be a local minimizer of f:RnRf: \mathbb{R}^{n} \rightarrow \mathbb{R} subject to h(x)=0,h:RnRm,mn\boldsymbol{h}(\boldsymbol{x})=\mathbf{0}, \boldsymbol{h}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}, m \leq n, and f,hC2f, \boldsymbol{h} \in \mathcal{C}^{2}. Suppose that x\boldsymbol{x}^{*} is regular. Then, there exists λRm\boldsymbol{\lambda}^{*} \in \mathbb{R}^{m} such that:

  1. Df(x)+λDh(x)=0D \boldsymbol{f}\left(\boldsymbol{x}^{*}\right)+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)=\mathbf{0}^{\top}.

  2. For all yT(x)\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right), we have yL(x,λ)y0\boldsymbol{y}^{\top} \boldsymbol{L}\left(\boldsymbol{x}^{*}, \boldsymbol{\lambda}^{*}\right) \boldsymbol{y} \geq 0.

Proof. The existence of λRm\boldsymbol{\lambda}^{*} \in \mathbb{R}^{m} such that Df(x)+λDh(x)=0D f\left(\boldsymbol{x}^{*}\right)+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)=\mathbf{0}^{\top} follows from Lagrange's theorem. It remains to prove the second part of the result. Suppose that yT(x)\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right); that is, y\boldsymbol{y} belongs to the tangent space to S={xRn:h(x)=0}S=\left\{\boldsymbol{x} \in \mathbb{R}^{n}: \boldsymbol{h}(\boldsymbol{x})=\mathbf{0}\right\} at x\boldsymbol{x}^{*}. Because hC2\boldsymbol{h} \in \mathcal{C}^{2}, following the argument of Theorem 20.1, there exists a twice-differentiable curve {x(t):t(a,b)}\{\boldsymbol{x}(t): t \in(a, b)\} on SS such that

x(t)=x,x˙(t)=y\boldsymbol{x}\left(t^{*}\right)=x^{*}, \quad \dot{x}\left(t^{*}\right)=\boldsymbol{y}

for some t(a,b)t^{*} \in(a, b). Observe that by assumption, tt^{*} is a local minimizer of the function ϕ(t)=f(x(t))\phi(t)=f(\boldsymbol{x}(t)). From the second-order necessary condition for unconstrained minimization (see Theorem 6.2), we obtain

d2ϕdt2(t)0\frac{d^{2} \phi}{d t^{2}}\left(t^{*}\right) \geq 0 \text {. }

Using the formula

ddt(y(t)z(t))=z(t)dydt(t)+y(t)dzdt(t)\frac{d}{d t}\left(\boldsymbol{y}(t)^{\top} \boldsymbol{z}(t)\right)=\boldsymbol{z}(t)^{\top} \frac{d \boldsymbol{y}}{d t}(t)+\boldsymbol{y}(t)^{\top} \frac{d \boldsymbol{z}}{d t}(t)

and applying the chain rule yields

d2ϕdt2(t)=ddt[Df(x(t))x˙(t)]=x˙(t)F(x)x˙(t)+Df(x)x¨(t)=yF(x)y+Df(x)x¨(t)0.\begin{aligned} \frac{d^{2} \phi}{d t^{2}}\left(t^{*}\right) &=\frac{d}{d t}\left[D f\left(\boldsymbol{x}\left(t^{*}\right)\right) \dot{\boldsymbol{x}}\left(t^{*}\right)\right] \\ &=\dot{\boldsymbol{x}}\left(t^{*}\right)^{\top} \boldsymbol{F}\left(\boldsymbol{x}^{*}\right) \dot{\boldsymbol{x}}\left(t^{*}\right)+D f\left(\boldsymbol{x}^{*}\right) \ddot{\boldsymbol{x}}\left(t^{*}\right) \\ &=\boldsymbol{y}^{\top} \boldsymbol{F}\left(\boldsymbol{x}^{*}\right) \boldsymbol{y}+D f\left(\boldsymbol{x}^{*}\right) \ddot{\boldsymbol{x}}\left(t^{*}\right) \geq 0 . \end{aligned}

Because h(x(t))=0\boldsymbol{h}(\boldsymbol{x}(t))=\mathbf{0} for all t(a,b)t \in(a, b), we have

d2dt2λh(x(t))=0.\frac{d^{2}}{d t^{2}} \boldsymbol{\lambda}^{* \top} \boldsymbol{h}(\boldsymbol{x}(t))=0 .

Thus, for all t(a,b)t \in(a, b),

d2dt2λh(x(t))=ddt[λddth(x(t))]=ddt[k=1mλkddthk(x(t))]=ddt[k=1mλkDhk(x(t))x˙(t)]=k=1mλkddt(Dhk(x(t))x˙(t))=k=1mλk[x˙(t)Hk(x(t))x˙(t)+Dhk(x(t))x¨(t)]=x˙(t)[λH(x(t))]x˙(t)+λDh(x(t))x¨(t)=0.\begin{aligned} \frac{d^{2}}{d t^{2}} \boldsymbol{\lambda}^{* \top} \boldsymbol{h}(\boldsymbol{x}(t)) &=\frac{d}{d t}\left[\boldsymbol{\lambda}^{* \top} \frac{d}{d t} \boldsymbol{h}(\boldsymbol{x}(t))\right] \\ &=\frac{d}{d t}\left[\sum_{k=1}^{m} \lambda_{k}^{*} \frac{d}{d t} h_{k}(\boldsymbol{x}(t))\right] \\ &=\frac{d}{d t}\left[\sum_{k=1}^{m} \lambda_{k}^{*} D h_{k}(\boldsymbol{x}(t)) \dot{\boldsymbol{x}}(t)\right] \\ &=\sum_{k=1}^{m} \lambda_{k}^{*} \frac{d}{d t}\left(D h_{k}(\boldsymbol{x}(t)) \dot{\boldsymbol{x}}(t)\right) \\ &=\sum_{k=1}^{m} \lambda_{k}^{*}\left[\dot{\boldsymbol{x}}(t)^{\top} \boldsymbol{H}_{k}(\boldsymbol{x}(t)) \dot{\boldsymbol{x}}(t)+D h_{k}(\boldsymbol{x}(t)) \ddot{\boldsymbol{x}}(t)\right] \\ &=\dot{\boldsymbol{x}}^{\top}(t)\left[\boldsymbol{\lambda}^{*} \boldsymbol{H}(\boldsymbol{x}(t))\right] \dot{\boldsymbol{x}}(t)+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}(\boldsymbol{x}(t)) \ddot{\boldsymbol{x}}(t) \\ &=0 . \end{aligned}

In particular, the above is true for t=tt=t^{*}; that is,

y[λH(x)]y+λDh(x)x¨(t)=0.\boldsymbol{y}^{\top}\left[\boldsymbol{\lambda}^{*} \boldsymbol{H}\left(\boldsymbol{x}^{*}\right)\right] \boldsymbol{y}+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right) \ddot{\boldsymbol{x}}\left(t^{*}\right)=0 .

Adding this equation to the inequality

yF(x)y+Df(x)x¨(t)0\boldsymbol{y}^{\top} \boldsymbol{F}\left(\boldsymbol{x}^{*}\right) \boldsymbol{y}+D f\left(\boldsymbol{x}^{*}\right) \ddot{\boldsymbol{x}}\left(t^{*}\right) \geq 0

yields

y(F(x)+[λH(x)])y+(Df(x)+λDh(x))x¨(t)0.\boldsymbol{y}^{\top}\left(\boldsymbol{F}\left(\boldsymbol{x}^{*}\right)+\left[\boldsymbol{\lambda}^{*} \boldsymbol{H}\left(\boldsymbol{x}^{*}\right)\right]\right) \boldsymbol{y}+\left(D f\left(\boldsymbol{x}^{*}\right)+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)\right) \ddot{\boldsymbol{x}}\left(t^{*}\right) \geq 0 .

But, by Lagrange's theorem, Df(x)+λDh(x)=0D f\left(x^{*}\right)+\lambda^{* \top} D \boldsymbol{h}\left(x^{*}\right)=\mathbf{0}^{\top}. Therefore,

y(F(x)+[λH(x)])y=yL(x,λ)y0,\boldsymbol{y}^{\top}\left(\boldsymbol{F}\left(\boldsymbol{x}^{*}\right)+\left[\boldsymbol{\lambda}^{*} \boldsymbol{H}\left(\boldsymbol{x}^{*}\right)\right]\right) \boldsymbol{y}=\boldsymbol{y}^{\top} \boldsymbol{L}\left(\boldsymbol{x}^{*}, \boldsymbol{\lambda}^{*}\right) \boldsymbol{y} \geq 0,

which proves the result.

Observe that L(x,λ)\boldsymbol{L}(\boldsymbol{x}, \boldsymbol{\lambda}) plays a similar role as the Hessian matrix F(x)\boldsymbol{F}(\boldsymbol{x}) of the objective function ff did in the unconstrained minimization case. However, we now require that L(x,λ)0\boldsymbol{L}\left(\boldsymbol{x}^{*}, \boldsymbol{\lambda}^{*}\right) \geq 0 only on T(x)T\left(\boldsymbol{x}^{*}\right) rather than on Rn\mathbb{R}^{n}.

The conditions above are necessary, but not sufficient, for a point to be a local minimizer. We now present, without a proof, sufficient conditions for a point to be a strict local minimizer.

Theorem (Second-Order Sufficient Conditions)

Suppose that f,hC2f, \boldsymbol{h} \in \mathcal{C}^{2} and there exists a point xRn\boldsymbol{x}^{*} \in \mathbb{R}^{n} and λRm\boldsymbol{\lambda}^{*} \in \mathbb{R}^{m} such that:

  1. Df(x)+λDh(x)=0D f\left(\boldsymbol{x}^{*}\right)+\boldsymbol{\lambda}^{* \top} D \boldsymbol{h}\left(\boldsymbol{x}^{*}\right)=\mathbf{0}^{\top}.
  2. For all yT(x),y0\boldsymbol{y} \in T\left(\boldsymbol{x}^{*}\right), \boldsymbol{y} \neq \mathbf{0}, we have yL(x,λ)y>0\boldsymbol{y}^{\top} \boldsymbol{L}\left(\boldsymbol{x}^{*}, \boldsymbol{\lambda}^{*}\right) \boldsymbol{y}>0. Then, x\boldsymbol{x}^{*} is a strict local minimizer of ff subject to h(x)=0\boldsymbol{h}(\boldsymbol{x})=\mathbf{0}.